HIVE-29649: Iceberg: Support Parquet DECIMAL_64 vectorization#6527
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to enable and validate Parquet DECIMAL_64 vectorization (including for Iceberg tables) by advertising DECIMAL_64 feature support from Parquet/Iceberg input formats, adding a Decimal64 read path to the Parquet vectorized reader, and extending test coverage and golden outputs to assert the new vectorization behavior.
Changes:
- Advertise
DECIMAL_64as a supported vectorization feature for Parquet (and Iceberg when enabled via table properties). - Add a
Decimal64ColumnVectorread path to the Parquet vectorized primitive column reader (including dictionary decode support). - Add/adjust LLAP query tests and Java unit tests to exercise
DECIMAL_64vectorization and update expected outputs.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| ql/src/test/results/clientpositive/llap/parquet_decimal64.q.out | New golden output validating Parquet DECIMAL_64 vectorization plan + results. |
| ql/src/test/queries/clientpositive/parquet_decimal64.q | New query to assert Parquet vectorized reader engages DECIMAL_64. |
| ql/src/test/org/apache/hadoop/hive/ql/io/parquet/VectorizedColumnReaderTestBase.java | Adds wiring for physical variations + a Decimal64 read verification helper. |
| ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestVectorizedDictionaryEncodingColumnReader.java | Adds a unit test entry point for Decimal64 reads (dictionary). |
| ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestVectorizedColumnReader.java | Adds a unit test entry point for Decimal64 reads (non-dictionary). |
| ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java | Implements Parquet Decimal64 read + dictionary decode logic. |
| ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java | Advertises DECIMAL_64 support for Parquet vectorization. |
| iceberg/iceberg-handler/src/test/results/positive/llap/vectorized_iceberg_read_parquet.q.out | Updates plan/output expectations to show DECIMAL_64 in use for Iceberg Parquet. |
| iceberg/iceberg-handler/src/test/results/positive/llap/vectorized_iceberg_read_multitable.q.out | Updates expected output (removes now-unneeded ORC-only toggling sections). |
| iceberg/iceberg-handler/src/test/results/positive/llap/vectorized_iceberg_read_mixed.q.out | Updates plan/output expectations for mixed ORC/Parquet Iceberg vectorization. |
| iceberg/iceberg-handler/src/test/queries/positive/vectorized_iceberg_read_multitable.q | Removes ORC-only toggling queries that are no longer required. |
| iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java | Removes ORC-only parameter propagation used for prior vectorization gating. |
| iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java | Enables advertising DECIMAL_64 for Iceberg when table property is enabled (regardless of ORC/Parquet). |
| iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/BaseHiveIcebergMetaHook.java | Removes ORC-only parameter constant + helper methods used for vectorization gating. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
e474e0b to
49ec3e4
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 105 out of 105 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/Decimal64ColumnVector.java:118
enforceScaleAndSetwritesvector[elementNum]on success but never clearsisNull[elementNum]. If the caller left a stale NULL flag (or initially marked the entry null before callingset), the value will be stored but still treated as NULL downstream. Sinceset(int, ...)is advertised as a fast path that assumes theisNullentry is already set, it should explicitly setisNull[elementNum] = falsewhen the value is in range.
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 106 out of 106 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (5)
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java:1
- This uses Java pattern matching for
instanceof(column instanceof Decimal64ColumnVector dec64), which will not compile on repositories building with Java < 16 (Hive commonly targets Java 8/11). Replace with a classicinstanceofcheck followed by an explicit cast and separate variable declaration.
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java:1 - This also uses Java pattern matching with a bound variable in
instanceof(hiveType instanceof DecimalTypeInfo hiveDti), which is not supported before Java 16. Use(hiveType instanceof DecimalTypeInfo) ? (DecimalTypeInfo) hiveType : getDecimalTypeInfo()to keep compatibility with older Java targets.
ql/src/test/org/apache/hadoop/hive/ql/io/parquet/VectorizedColumnReaderTestBase.java:1 - Using a fixed filename under
java.io.tmpdircan still collide when tests are run concurrently (e.g., parallel surefire/failsafe, multiple forks, or repeated runs on the same agent), causing flaky failures. Prefer generating a unique per-test path (e.g., UUID-based) and/or using the test framework’s temp directory utilities.
ql/src/test/org/apache/hadoop/hive/ql/io/parquet/VectorizedColumnReaderTestBase.java:1 BigInteger.toByteArray()can return a byte array longer than the requested fixed length (e.g., requiring an extra leading sign byte), which will makelen - minimal.lengthnegative and throwArrayIndexOutOfBoundsException. Add a guard to handleminimal.length > len(either trim a leading sign byte when safe, or fail with a clear assertion since this is test code).
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java:1- This allocates a new array on every call. Consider returning a
private static final VectorizedSupport.Support[]constant to avoid repeated allocations on hot paths.



What changes were proposed in this pull request?
Adds support for Parquet DECIMAL_64 vectorization
Why are the changes needed?
Performace
Does this PR introduce any user-facing change?
No
How was this patch tested?
parquet_decimal64.q, vectorized_iceberg_read_multitable.q